首页> 外文OA文献 >Optimal Cooperative Checkpointing for Shared High-Performance Computing Platforms
【2h】

Optimal Cooperative Checkpointing for Shared High-Performance Computing Platforms

机译:共享高性能计算平台的最佳合作检查点

代理获取
本网站仅为用户提供外文OA文献查询和代理获取服务,本网站没有原文。下单后我们将采用程序或人工为您竭诚获取高质量的原文,但由于OA文献来源多样且变更频繁,仍可能出现获取不到、文献不完整或与标题不符等情况,如果获取不到我们将提供退款服务。请知悉。
获取外文期刊封面目录资料

摘要

In high-performance computing environments, input/output (I/O) from varioussources often contend for scare available bandwidth. Adding to the I/O operations inherent tothe failure-free execution of an application, I/O from checkpoint/restart (CR) operations (usedto ensure progress in the presence of failures) places an additional burden as it increases I/Ocontention, leading to degraded performance. In this work, we consider a cooperative schedulingpolicy that optimizes the overall performance of concurrently executing CR-based applicationswhich share valuable I/O resources. First, we provide a theoretical model and then derive aset of necessary constraints needed to minimize the global waste on the platform. Our resultsdemonstrate that the optimal checkpoint interval as defined by Young/Daly, while providing asensible metric for a single application, is not sufficient to optimally address resource contentionat the platform scale. We therefore show that combining optimal checkpointing periods with I/Oscheduling strategies can provide a significant improvement on the overall application performance,thereby maximizing platform throughput. Overall, these results provide critical analysis and directguidance on checkpointing large-scale workloads in the presence of competing I/O while minimizingthe impact on application performance.
机译:在高性能计算环境中,来自各种来源的输入/输出(I / O)通常争夺可用带宽的不足。除了应用程序的无故障执行所固有的I / O操作外,检查点/重启(CR)操作的I / O(用于确保出现故障时的进度)还增加了I / O争用的负担,导致降低性能。在这项工作中,我们考虑了一种协作调度策略,该策略可以优化并发执行基于CR的应用程序的整体性能,这些应用程序共享宝贵的I / O资源。首先,我们提供了一个理论模型,然后得出了一组必要的约束条件,以最大程度地减少平台上的全球浪费。我们的结果表明,Young / Daly定义的最佳检查点间隔虽然为单个应用程序提供了可度量的指标,但不足以在平台规模上最佳地解决资源争用问题。因此,我们表明,将最佳检查点时间与I / O调度策略相结合可以显着改善整体应用程序性能,从而最大程度地提高平台吞吐量。总体而言,这些结果为在存在竞争性I / O的同时为大型工作负载提供检查点提供了关键的分析和直接指导,同时最大程度地减少了对应用程序性能的影响。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
代理获取

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号